How to do feature selection for text data?
Is PCA a FS method for text?
Other methods?
How to do feature selection for text data?
Is PCA a FS method for text?
Other methods?
The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set.
For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…) – the increased space of possibilities is more difficult to search.
to improve performance (in terms of speed, predictive power, simplicity of the model).
to visualize the data for model selection.
to reduce dimensionality and remove noise.
Idea: Transform a discrete space into a continuous space.
\[ \begin{align} IG(t) = &-\sum_c{p(c)log{p(c)}} \\ &+ p(t)\sum_c{p(c|t)log{p(c|t)}} \\ &+ p(\bar t)\sum_c{p(c|\bar t)log{p(c|\bar t)}} \end{align} \]
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term \(t\).
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
âž” normalized gini-index
\[p'(c|t) \equiv \frac{p(c|t)/p(c)}{\sum_{i=1}^k{p(i|t)/p(i)}} \]
\[G(t) \equiv \sum^k_{c=1}{p'(c|t)^2}\]
The pointwise mutual information \(M_c(t)\) between the term \(t\) and the class \(c\) is defined on the basis of the level of co-occurrence between the class \(c\) and term \(t\). Let \(p(c)\) be the unconditional probability of class \(c\), and \(p(c | t)\) be the probability of class \(c\), given that the document contains the term \(t\).
Let \(p(t)\) be the fraction of the documents containing the term \(t\), i.e. the unconditional probability of term \(t\).
The expected co-occurrence of class \(c\) and term \(t\) on the basis of mutual independence is given by \(p(c) \cdot p(t)\). The true co-occurrence is of course given by \(p(c | t) \cdot p(t)\).
\[M_c(t) = log(\frac{p(c|t) \cdot p(t)}{p(c) \cdot p(t)}) = log(\frac{p(c|t)}{p(c)})\]
\[M_{avg}(t) = \sum^k_{c=1}{p(c) \cdot M_c(t)}\] \[M_{max}(t) = \max_c{\{M_c(t)\}}\]
\[{\chi}_c^2(t) = \frac{n \cdot p(t)^2 \cdot (p(c|t) - p(c))^2}{p(t) \cdot (1- p(t)) \cdot p(c) \cdot (1 - p(c))}\]
Test whether distributions of two categorical variables are independent of one another
Degree of freedom = \((\#col-1) \times (\#row-1)\)
Significance level: \(\alpha\), i.e., \(p\mbox{-}value<\alpha\)
↳ Look into \({\chi}^2\) distribution table to find the threshold
\({\chi}^2\) statistics
Test whether distributions of two categorical variables are independent of one another
Degree of freedom = \((\#col-1) \times (\#row-1)\)
Significance level: \(\alpha\), i.e., \(p\mbox{-}value<\alpha\)
For the features passing the threshold, rank them by descending order of \({\chi}^2\) values and choose the top \(k\) features
\({\chi}^2\) statistics with multiple categories
\({\chi}^2=\sum_c{p(c) {\chi}^2(c,t)}\)
\({\chi}^2(t) = \underset{c}{max}\ {\chi}^2(c,t)\)
Many other metrics (Same trick as in \(\chi^2\) statistics for multi-class cases)
Mutual information
\[PMI(t;c) = p(t,c)log(\frac{p(t,c)}{p(t)p(c)})\]
Odds ratio
\[Odds(t;c) = \frac{p(t,c)}{1 - p(t,c)} \times \frac{1 - p(t,\bar{c})}{p(t,\bar{c})}\]
\[var(X) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(X_i - \bar X)}}{(n - 1)}\] \[cov(X,Y) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(Y_i - \bar Y)}}{(n - 1)}\]
\[\min_\alpha{\hat{R}(\alpha, \sigma)} = \min_\alpha{\sum_{k=1}^m{L(f(\alpha, \sigma \circ x_k), y_k) + \Omega(\alpha)}}\]
Replace the regularizer \(||w||^2\) by the \(l_0\) norm \(\sum_{i=1}^n{1_{w_i \neq 0}}\)
Further replace \(\sum_{i=1}^n{1_{w_i \neq 0}}\) by \(\sum_i{log{(\epsilon + |w_i|)}}\)
Boils down to the following multiplicative update algorithm:
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.
Split data into 3 sets: training, validation, and test set.